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1. INTRODUCTION 

An essential challenge faced by the traditional classification algorithms is the distribution of data 
where the classes are imbalanced. For example, healthy transactions are significantly bigger than fraudulent 
transactions. In this situation, the classifiers trend to the majority class and ignore the minority one. There are 
three categories for classical imbalanced data classification approaches. The algorithmic level methods that 
try to strengthen the classification algorithm to enforce the learning towards the minority samples [1], [2]. 
The second group of approaches is ensembles classifiers that contain two methods [3], [4]: bagging and 
boosting. Bagging includes various classifiers that are used to subsets of the dataset [5]. Likewise, in 
boosting, the complete dataset is applied to train classifiers so that it gives more attention to the samples that 
are misclassified [6], [7]. The third category entails scenarios such as pre-processing the data to balance 
before providing as the input data or improving the classifiers. The data processing known data level 
techniques are preferred as it has vast applications [8], [9]. 

The major objective of data level algorithms is to either decreasing or increasing the class number. 
These approaches try to achieve the same sample number for both classes. The under-sampling approach 
attempts to reduce the instances of the majority class. This technique discards useful information which could 
be essential for classifiers. Moreover, it is an inaccurate representation of the population. The over-sampling 
algorithm increases the minority class number by replicating samples [10], [11]. Unlike under-sampling, this 
approach leads to no information loss. However, it increases the probability of overfitting because of 
reproducing the minority class samples [12], [13]. 
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To overcome the challenges of under and over sampling algorithms, some researchers proposed to 
combine the approach with other techniques. Khan et al. [14] described an approach in which a cost-sensitive 
method based on the neural networks can train representations of the feature for both the minority one and 
the majority category. This method tries to improve the classifier. Therefore, the original data is no change. 
Castellanos et al. [15] suggested a strategy based on a string converting. This strategy converts the SMOTE 
technique to a string space. The improvement of this method is 97.5 according to the F-measure score. 

Many researchers have tried to perform a clustering method to balance the classes. Prachuabsupakij 
et al. [16] suggested an approach in which a k-means based method decreases the overlapping of the classes. 
This method clusters the original dataset into two classes. Then, a clustering switching method and the 
SMOTE technique are performed on each class. The output is two balanced training set. This model clusters 
the majority class into two classes without regarding the size of the minority class. The average F-measure of 
this model was 0.975. Czarnowski et al. [17] presented an approach based on clustering where similarity 
coefficient computed for samples of each data class independently. Then, similar samples are clustered into 
the same class. The maximum accuracy of this method is 98.01. Lin et al. [18] proposed an under-sampling 
method using the clustering technique that the majority data is divided into k class. This algorithm calculates 
k regarding the minority class size. The best average classification accuracy is 0.904. 

This paper introduces a combinatorial algorithm to overcome the imbalanced problem. It tries to 
produce the minority class item by the SMOTE method. Likewise, it uses a clustering algorithm to decrease 
the majority class. Unlike previous approaches, it clusters the majority one regarding the new minority one. 
The novelty of this work is that the rate of increase of the minority class and decrease of the corresponding 
majority class is done together. The paper has been arranged as; the next section includes some basic 
techniques relevant background and the proposed algorithm, section 3 and 4 provide the results of the 
experiments. Finally, concluding points are in section 5. 


2. RESEARCH METHOD 

Before the proposed algorithm was introduced, a summary of the basic knowledge would be 
presented. This work uses SMOTE technique to increase the minority classes number. Moreover, the 
approach uses a clustering method as an under-sampling algorithm to decrease the majority class. 


2.1. SMOTE technique 

This algorithm carries out an over-sampling method to balance the imbalanced data [19]. The major 
idea of the method is to produce synthetic samples. The new instance is created according to the interpolation of 
some samples in minority class that are neighborhood space. Therefore, it focuses on the feature aspect instead 
of the data one. In other words, the method considers both the value of features and the relationship between 
them [12]. Figure 1 depicts a simple example of SMOTE. First, a minority class sample yi is considered to 
produce a new synthetic point. Then, several nearest neighbors regarding a distance metric are selected. Finally, 
k samples are selected in a random way to obtain the new samples by insertion (yj to yx). Therefore, the distance 
between the considered instance and its neighbors is multiplied by a random coefficient between O and 1. 
Consequently, some new points are added which one is chosen at random (rd; to rdx). 
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Figure 1. Producing the new SMOTE point 


2.2. K-means algorithm 
The k-means method is widely used in the machine learning area. It is an iterative technique that 
attempts to divide the dataset into k distinct cluster where each data item belongs to only one cluster [20]. 
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Moreover, if the distances of the data points and each centroid of the cluster are calculated, the sum of them 
should be at the minimum. In other words, there is less variation within the clusters. The phases of the 
algorithm include: 

- Compute the clusters number (K) 

- Select k centroids from data points randomly 

- Keep iterating until the centroids are constant 

- Calculate the sum of the distance of data points and the centroids 

- Determine the centroids of each cluster regarding the average of the data point of the cluster 


J = Xa Dyer will? — IN? (1) 


2.3. Proposed algorithm 

This paper tries to combine both under and over sampling approaches. The proposed method 
performs the SMOTE algorithm on the minority class to increase its samples. Moreover, it uses a clustering 
method to decrease the majority class without losing data. The phases of the algorithm are as: 
- Performing SMOTE technique on minority cluster 
- Computing the clusters number by proportion the majority size and the size of the new minority one 
- Performing the k-means algorithm on majority cluster 
- Combine each cluster with new minority class 
- Performing a classifier for each class 
- Classification with maximum probability vote 

Figure 2 shows the flow chart of proposed algorithm. The method tries to increase the minority class 
by considering the IR of the dataset. The number of clusters (known K) is determined according to the new 
minority size and majority size. The value of K is equal to the size of the majority class divided by the size of 
the minority class. In the next step, the K-mean algorithm is performed on the majority class to produce k 
clusters. Then, each cluster is combined with the new minority class. A classifier categorizes each new 
cluster. Finally, the model selects the cluster with maximum probability vote. 
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Figure 2. Flow chart of the proposed algorithm 
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3. RESULTS AND DISCUSSION 
3.1. Dataset and experimental setting 

The experimental datasets are all from the KEEL repository [21]. The datasets have a various 
imbalanced ratio. The number of data samples is from 214 to 5472. Table 1 shows the experimental 
parameters of them. 


Table 1. Datasets information 


No. Dataset Attributes Number of Samples IR 
1 glass1 9 214 1.82 
2 pima 8 768 1.87 
3 vehicle3 18 846 2.99 
4 ecolil 7 336 3.36 
5 new-thyroid1 5 215 5.14 
6 ecoli2 7 336 5.46 
7 page-blocks 10 5472 8.79 
8 yeast6 8 1484 41.4 
9 poker-8-vs-6 10 1477 85.88 
10 abalone19 8 4174 129.44 


This paper performs SMOTE algorithms to increase minority instances. The number of 
oversampling instances is determined according to the IR of the dataset. Then, the number of clusters (k) is 
calculated regarding the new minority cluster and the majority size. In the next step, the K-means method 
produces the clusters. Then, each cluster is combined with the new minority class. Classification is done for 
each new dataset. Finally, voting selects the best one. Figure 3 presents the proposed algorithm steps in 
detail. To evaluate the classification by the proposed algorithm, four different classifiers were performed 
including decision tree [22], support vector machine (SVM) [23], nearest neighbor classifiers [24], and 
ensemble classifiers [19]. 


The proposed Algorithm 


1) Given I{(x1,y1), .-.,(Xn,Yn)} Xi € X , Y=2 
2) Mi= the minority cluster number 

3) Mj, =the majority cluster number 
4) IR=M,/M; 

5) Kis the clusters number 


6) New_Mi = SMOTE(M)) 

7) K=M; / New_M; 

8) Ci.x = Kmeans (Mj, k) 

9) For dc=1 to K do 

10) { 

11) SG = New_Mi ÙU Ca 

12)  vi= baseclassifier (SCi) 

13) } 

Output: V= voting vi regarding maximum probability 


Figure 3. The proposed algorithm pseudo code 


3.2. Evaluation methods 
Accuracy calculates the correct predicted instances number over the all instances 


TP+TN 
TP+TN+FP+FN 


Accuracy = (2) 
where: FP is an outcome that indicates something is present when really is not 

FN is a result that presents negative when it should not 

TP is an upshot indicates positive when really is 


TN is a result that shows negative when really is 
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But if the distribution is unbalanced it can be misleading. If the distribution is unbalanced accuracy 
can be misleading. Therefore, it is better to rely on precision and recall. Likewise, in the same way, a 
Precision-Recall curve is suitable to evaluate the classifier in an imbalanced class. Moreover, the region 


under the curve known AUC is a performance measurement for the classification method. 





Precision = —— (3) 
TP+FP 
Ree = — (4) 
TP+FN 


Table 2 presents the accuracy of the approaches on 10 datasets. Moreover, Table 3 shows the AUC 
for all datasets in the three situations. For comparing the performance of the suggested approach, the 
classification results for the normal dataset, dataset after performing SMOTE, a density-based under- 
sampling algorithm (DBU) [25] and the proposed method outcome were evaluated. As mentioned before, it is 
easy to get a high accuracy without actually making a suitable prediction when there are imbalanced classes. 
Therefore, precision and recall were computed. Then, the area under the precision-recall curves (AUC) was 
used as a summary of the model performance. A precision-recall curve shows a balance achieved between 
the TP rate and the positive value that the model predicts using different probability thresholds. Table 3 
presents the AUC for all datasets in the three situations. 


Table 2. Accuracy for all datasets 


No. Dataset Normal % SMOTE % The proposed Metod % 

1 glass1 84.3 93.4 98.9 

2 pima 71.5 83.9 83.9 

3 vehicle3 85.2 88.7 93 

4 ecolil 90.2 92.5 99.3 

5 new-thyroid1 99.1 99.2 99.6 

6 ecoli2 96.1 O72 97.8 

7 page-blocks 89.3 98 99.3 

8 yeast6 98.4 96.3 97.7 

9 poker-8-vs-6 99.6 99.7 99.8 

10 abalone19 99.2 99.1 99.6 

Table 3. AUC for all datasets 
No. Dataset Normal SMOTE DBU The proposed Method 

1 glass1 0.916095 0.989536 0.954325 0.994768 
2 pima 0.830507 0.912511 0.736557 0.912511 
3 vehicle3 0.899753 0.943608 0.758543 0.976075 
4 ecolil 0.972547 0.973063 0.797543 0.989592 
5 new-thyroid1 0.982937 0.999921 0.985654 0.999996 
6 ecoli2 0.946167 0.990605 0.984432 0.992727 
7 page-blocks 0.927108 0.998655 0.943543 0.998866 
8 yeast6 0.968653 0.969167 0.859876 0.978043 
9 poker-8-vs-6 0.881668 0.969903 0.943754 0.998708 
10 abalone19 0.608115 0.923286 0.652543 0.990296 


The proposed algorithm combines oversampling and under-sampling techniques. The decreasing 
rate of majority class is done by considering the rate of increasing minority class. Therefore, the proposed 
method uses more instances of the original data. Moreover, the increasing rate of minority class is according 
to the IR of the dataset. The results show increase accuracy and AUC of the proposed model compared to the 
SMOTE method on benchmark imbalanced datasets from the KEEL repository. 


4. CONCLUSION 

This work proposes a novel technique to bias imbalanced data. To overcome the unbalance problem, 
both under and over-sampling approaches are used. Most imbalanced data classification techniques try to 
balance the data using increasing the minority class or decreasing the majority class that results in changing 
the original data. The proposed algorithm tries to reduce the changing rate of the primary dataset. The 
algorithm firstly performs the SMOTE method on the minority cluster to produce new instances. Then, the k- 
means clustering algorithm decreases the majority class, which considers k regarding the size of the new 
minority class. Finally, each cluster and the new minority class are considered as the input data of a classifier. 
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The experimental results show increase accuracy and AUC of the proposed model compared to the SMOTE 
method on benchmark imbalanced datasets from the KEEL repository. The method is further performed on 
other real-world engineering datasets. 
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